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© Method of generating components of a speech database using the speech synthesis technique 
and machine for automatic speech recognition. 

© The invention relates to a method of generating 
components of a speech database using the speech 
synthesis technique and to a machine for automatic 
speech recognition. 

Asking a speaker to repeat speech elements on 
the basis of a preventive series of utterances cor- 
responding to automatically synthesized utterances 
of said speech elements, the quality of the database 
thus obtained is better and uniform. 



Ul 



Rank Xerox (UK) Business Services 

<3. 10/3.09/3.3.4) 



1 



EP 0 642 116 A1 



2 



The present invention relates to a method of 
generating components of a speech database us- 
ing the speech synthesis technique and to a ma- 
chine for automatic speech recognition. 

Basically, machines for speech recognition can 
be divided into two categories: the first ones are 
based upon conventional processors and realize 
the recognition by comparing the word to be recog- 
nized with words of a pre-established vocabulary; 
the second ones are based on special architectures 
such as neural networks and the pre-established 
vocabulary depends on a set of parameter values 
characterizing such architectures; hence the first 
machines require the generation of a speech 
database corresponding to the pre-established vo- 
cabulary for their operation, while the second ones 
require the generation of a suitable set of param- 
eter values corresponding to the pre-established 
vocabulary which could be considered as a distrib- 
uted speech database. 

As known, the generation of such databases, 
either concentrated or distributed, occurs through 
long and repeated recording operations; in general, 
a team of speakers is selected and each speaker 
utters a number of times the speech elements 
corresponding to a predetermined vocabulary (in 
general words or, less frequently, the syllables); the 
acoustic signals corresponding to such utterance 
are acquired and often tape-recorded; subsequent- 
ly a processing step may follow consisting, e.g., in 
a background noise filtering, in a sampling and in a 
digitizing; lastly, the database real generation is 
carried out, which may simply consist in storing on 
semiconductor storages according to a pre-estab- 
lished format or, in addition to and before storage, 
in the generation of suitable parameters, for in- 
stance LPC (Linear Predictive Code), starting from 
the acquired and processed acoustic signals ; in 
case of neural networks, the generation of the 
distributed database occurs by directly providing 
the network (that subsequently will carry out the 
recognition) with the acquired and processed 
acoustic signals and by leaving the network itself 
changing the values of its parameters during a step 
called "training". 

The word "training", when referred to ma- 
chines for speech recognition belonging to the first 
category, indicates an operative step during which 
the concentrated database is enhanced with new 
utterances of speech elements belonging or less to 
the predetermined vocabulary; not all such ma- 
chines feature a "training" step. 

The generation of such databases must be 
realized with great care since the recognition rate 
strongly depends on the used data base. 

Substantially there are two methods by which 
the speakers can be allowed to utter the speech 
elements belonging to the predetermined vocabu- 



lary. 

The first one consists in providing each speak- 
er with a written list of the speech elements to be 
uttered: this method has the disadvantage, very 
5 heavy if speakers are unprofessional, of leading to 
an unnatural and erratic pronunciation due to the 
fact that the speaker starts the utterance with some 
voice characteristics, such as high energy, consid- 
erably imperative prosody, high speed of words 
10 alternated each other by a long silence, clear and 
well scanned articulation of syllables, and termi- 
nates the utterance with lower energy, more apa- 
thetic (i.e. meaningless) prosody, low speed of 
words alternated each other by a short silence, 
75 fluent articulation of syllables. Such a method, 
moreover, is not always applicable as in case of 
acquisition of speech databases from telephone 
line, where the speaker is chosen at random 
among the subscribers, or in a car where the driver 
20 cannot drive and read at the same time. 

The second method can be applied, e.g., in the 
just mentioned cases and consists in asking the 
speaker to repeat the speech elements uttered by 
another people called "operator": it has been dis- 
25 covered that, with such method, the utterance of 
the speaker is disadvantageously altered because 
the speaker tends to "copy" the pronunciation of 
the operator, more precisely: speed and energy of 
words and articulation of their components, accen- 
30 tuation of vowels, prosody of words, emphasis, 
cadence or rhythm of syllables and eventual per- 
sonal characteristics of the operator (e.g. dialect, 
emotional, physical characteristics). If the operator 
or his personal characteristics change, as is the 
35 frequent case of prolonged recording operations, a 
totally uneven database will be obtained and this is 
a further disadvantage. 

The object of the present invention is to over- 
come the drawbacks of the known technique. 
40 This object is achieved through the method of 

generating a component of a speech database as 
set forth in claim 1, through the methods of gen- 
erating and enhancing a speech database as set 
forth in claims 8 and 9 respectively, and through 
45 the machine for automatic speech recognition as 
set forth in claim 10. 

Further advantageous aspects of the present 
invention are set forth in the subclaims. 

By asking the speaker to repeat speech ele- 
50 ments on the basis of a preventive series of voiced 
emissions corresponding to automatically synthe- 
sized utterances of such speech elements, the 
quality of the database thus obtained is better and 
uniform. 

55 If such synthetic utterances are particularly un- 

natural, the copy-effect is limited. 

Moreover, by making the sound emission be 
cadenced in a pre-established manner by a ma- 
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chine, the list effect is limited. 

Lastly, by making such synthetic utterances be 
automatically synthesized still in accordance with 
the same method and the same synthesis param- 
eters, the uniformity of the database is consider- 
ably improved. 

The present invention will result better from the 
following description. 

In accordance with the present invention, the 
method of generating a component of a speech 
database corresponding to the utterance of a 
speech element, comprises the steps of : 

a) emitting an automatically synthesized utter- 
ance of the speech- element, 

b) waiting for a speech acoustic signal cor- 
responding to an utterance of the speech ele- 
ment, and 

c) acquiring such speech acoustic signal. 

Such synthesized utterance can be synthe- 
sized starting from a text or from a natural pronun- 
ciation so modified to be unnatural, in particular 
prosodically unnatural. 

Such acquired speech acoustic signal concep- 
tually corresponds to the desired component of the 
speech database. 

It has been said "conceptually" because the 
component assumes very different forms depend- 
ing on the circumstances: for instance, in the case 
of a neural network speech database, it will be 
formed by some values of network parameters or 
by some variations of the same not connected in a 
simple way to the acquired acoustic signal and 
calculated automatically by the network itself ac- 
cording to a predetermined algorithm. 

Then, such acquired speech acoustic signal is 
processed: sometimes by carrying out a simple 
sampling and digitizing sometimes through com- 
plicated digital coding algorithms, still sometimes 
through analog operations. 

The generation of a speech database compo- 
nent can be carried out either prior to the speech 
recognition step from equipments arranged on pur- 
pose or during the same step from the same 
machine for speech recognition. 

In the first case, the acquisition step is almost 
always overseen by an operator who takes care at 
least of the recording operations and who has the 
possibility of recognizing the occurrence of anoma- 
lous conditions and find a remedy for them. 

In particular, in the second case, it is to advan- 
tage that the method further comprises the step of 

d) verifying the acquired speech acoustic signal 
corresponds to an utterance of the voice ele- 
ment, e.g., through comparison with the syn- 
thesized utterance. 
Should this verification be unsuccessful, such 
anomalous condition can be signalled. 



Moreover, it may be advisable to provide that, 
if step b) is longer than a predetermined period of 
time, step c) does not take place; hence the gen- 
eration of the element has failed and such anoma- 
5 lous condition can be signalled. 

Such signallings are useful to the speaker who 
is uttering the speech element of a vocabulary in 
order that he can find a remedy for them. 

The fact that the utterance is synthetic can be 
10 advantageously utilized in two different ways: if the 
pronunciation is greatly unnatural, the speaker will 
be neither able nor inclined to "copy" such pronun- 
ciation (consequently the synthesizer will be very 
easy to realize and cheaper); if on the other hand 
75 the synthetic pronunciation is fairly natural, the 
speaker will be inclined to copy it, of course; there- 
fore, it can be thought to individuate, through lab- 
oratory tests, those values of the synthesis param- 
eters which allow the achievement of an "ideal" 
20 pronunciation from the speaker that is to say which 
provides the best results during recognition step; in 
any case the possibility of varying the synthesis 
parameters will allow the indirect control of the 
speaker's pronunciation advantageously. 
25 Step a) can be preceded by a step of taking 

the synthetic pronunciation out of storage means or 
out of an automatic synthesis step of such pronun- 
ciation starting from the corresponding speech ele- 
ment. 

30 Methods of generating and increasing speech 

databases can be derived on the basis of the 
method of generating only one component of a 
speech database, just described. 

In accordance with the present invention, the 

35 method of generating a speech database corre- 
sponding to a predetermined vocabulary compris- 
ing a plurality of speech elements provides that the 
steps of the method just described are repeated at 
least once for each speech element of the vocabu- 

40 lary. 

According to the present invention, the method 
of enhancing a speech database corresponding to 
utterances of speech elements belonging to a pre- 
determined vocabulary, through new utterances of 
45 speech elements belonging or less to such vocabu- 
lary, provides that the steps of the method just 
described are repeated for each new utterance. 

It has been explained how the machines for 
automatic speech recognition provide a "training" 
so operative step during which the speech database 
corresponding to a recognition vocabulary of 
speech elements is generated or enhanced. 

The machine for automatic speech recognition, 
in accord to the present invention, during the train- 
55 ing step is designed to prepare itself in order to 
realize the steps of the method described above. 

In a particularly simple embodiment, such ma- 
chine comprises storage means capable of contain- 
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ing automatically synthesized utterances of speech 
elements of the recognition vocabulary. 

In this circumstance, it will be enough to read 
out from such storage means the various utter- 
ances, emit them one at a time and wait for the 
reciter to repeat them. 

Such storage means may coincide with those 
for containing the speech database. 

Starting from an extremely simple initial 
speech database having only one component for 
each speech element of the vocabulary, such 
database can be enlarged by using the utterances 
of the initial database as synthetic utterances: if the 
first ones come from automatic synthesis oper- 
ations they can be emitted without further process- 
ing, otherwise they can be modified in such a way 
as to result unnatural, as already said. 

In a second embodiment, such machine com- 
prises an automatic speech synthesizer capable of 
synthesizing and emitting utterances of speech ele- 
ments also not comprised in the recognition vo- 
cabulary. 

The recognition vocabulary is chosen during 
production; it may be contemplated that the pur- 
chaser personalizes the vocabulary by adding per- 
sonal speech elements. 

Such personal speech elements may, e.g., be 
introduced into such machine in a textual form and 
synthesized and emitted by the synthesizer during 
the training step. 

Claims 

1. Method of generating a component of a 
speech database corresponding to the utter- 
ance of a speech element, comprising the 
steps of : 

a) emitting an automatically synthesized ut- 
terance of said speech element, 

b) waiting for a speech acoustic signal cor- 
responding to an utterance of said speech 
element, and 

c) acquiring said speech acoustic signal; 
whereby such acquired speech acoustic 
signal conceptually corresponds to said 
component. 

2. Method according to claim 1, characterized in 
that said acquired speech acoustic signal is 
processed. 

3. Method according to claim 1, characterized in 
that it further comprises the step of 

d) verifying that said acquired speech 
acoustic signal corresponds to an utterance 
of said speech elements through compari- 
son with said synthesized utterance. 



4. Method according to claim 1, characterized in 
that, if said step b) is longer than a predeter- 
mined period of time, said step c) does not 
take place and such anomalous condition is 

5 signalled. 

5. Method according to claim 1, characterized in 
that said synthesized utterance is of unnatural 
type. 

w 

6. Method according to claim 1, characterized in 
that said step a) is preceded by a step for 
extracting said utterance from storage means. 

75 7. Method according to claim 1, characterized in 
that said step a) is preceded by a step of 
automatic synthesizing said utterance starting 
from the corresponding speech element. 

20 8. Method of generating a speech database cor- 
responding to a predetermined vocabulary 
comprising a plurality of speech elements, 
characterized in that said steps of the method 
of claim 1 are repeated at least once for each 

25 speech element of said vocabulary. 

9. Method of enhancing a speech database cor- 
responding to utterances of speech elements 
belonging to a predetermined vocabulary, 
30 through new utterances, of speech elements 

belonging or less to said vocabulary, char- 
acterized in that the steps of the method of 
claim 1 are repeated for each new utterance. 

35 10. Machine for the automatic recognition of 
speech in relation to a predetermined recogni- 
tion vocabulary of speech elements, character- 
ized in that, during the step of training, it is 
capable of arranging itself in such a way as to 

40 realize the steps of the method of claim 1 . 

11. Machine according to claim 10, characterized 
in that it comprises storage means designed to 
contain automatically synthesized utterances of 

45 speech elements of said recognition vocabu- 

lary. 

12. Machine according to claim 10, characterized 
in that it comprises an automatic speech syn- 

50 thesizer designed to synthesize and emit utter- 

ances of speech elements even not comprised 
in said recognition vocabulary. 
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